Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multitenant migrations execution #431

Merged
merged 4 commits into from
Feb 9, 2024
Merged

Conversation

fenos
Copy link
Contributor

@fenos fenos commented Jan 29, 2024

What kind of change does this PR introduce?

Feature

What is the current behavior?

Currently, migrations on tenants are run on the first request from when the server starts.
With this approach, there are a few different drawbacks:

  • Expensive migrations (ex adding an index) cannot be run as this impacts the response time
  • Build up connections to the database when concurrent requests come in

What is the new behavior?

This PR introduces 3 migrations strategy modes:

  • on_request
  • full_fleet
  • progressive

Migrations are now tracked in the multi-tenant db with migrations_version and migrations_status.
In all modes, migrations will be run only if a tenant is not on the latest version.

On Request

on_request mode will run the migrations to a tenant database during a request. Migrations will block the request lifecycle until completed.

In case the migration fails, the request will also fail with a 500 error.

Full Fleet

This mode offers a background scheduler to run migrations across all tenants asynchronously.

On Server Start, the instances race to acquire an advisory lock.
The instance that acquires the lock is responsible for selecting all tenants that need a migration to run.

We emit RunMigrationsOnTenant job for each tenant in a batch, we use a singleton key to make sure only 1 job for this specific tenant can exist at any one time.

Each Storage instance can consume the events and run the migrations on each single tenant database.
Migrations are retried automatically 3 times.

Failures don't impact the request lifecycle and the migrations will run on all tenants that are not on the latest version.

Progressive

This mode is a hybrid version of the 2 above. It works similarly to the full_fleet mode since migrations are run with an async job, but the tenants are selected during a request.

During a tenant request, we determine if the tenant needs a migration to run.
If it does we register the tenantId in a list in memory.

A background worker will pop a batch of 200 tenants from this list and schedule RunMigrationsOnTenant jobs every 1m.

Deduping between instances is done via the singletonKey and early checks.

In case of failure, migrations are retried 3 times, after which the status becomes FAILED_STALE.
When using progressive migrations and the staus is FAILED_STALE we will continue running the migrations during the request lifecycle without blocking the request for this tenant, with the hope that it recovers from unexpected prolonged issues to the underline database.

Additional context

With this change, we will see the following benefits:

  • Increased performance on req/res time on server restart
  • Fewer database connections to tenant databases
  • Possibility to run long-running async migrations

@fenos fenos force-pushed the multitenant/migrations-execution branch 2 times, most recently from 9075ee7 to 30370d1 Compare January 30, 2024 11:08
@coveralls
Copy link

coveralls commented Jan 30, 2024

Pull Request Test Coverage Report for Build 7842314553

  • -676 of 1052 (35.74%) changed or added relevant lines in 33 files are covered.
  • 17 unchanged lines in 5 files lost coverage.
  • Overall coverage decreased (-3.7%) to 78.602%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/http/routes/health/healthcheck.ts 0 1 0.0%
src/queue/workers.ts 1 2 50.0%
src/storage/tus/postgres-locker.ts 0 1 0.0%
src/pubsub/postgres.ts 13 15 86.67%
src/monitoring/logger.ts 21 24 87.5%
src/database/pubsub.ts 3 8 37.5%
src/concurrency/mutex.ts 7 28 25.0%
src/http/plugins/log-request.ts 5 26 19.23%
src/http/routes/admin/migrations.ts 17 39 43.59%
src/http/routes/admin/tenants.ts 40 86 46.51%
Files with Coverage Reduction New Missed Lines %
src/monitoring/logger.ts 1 90.59%
src/queue/workers.ts 1 37.5%
src/queue/queue.ts 2 15.05%
src/database/tenant.ts 3 49.66%
src/storage/errors.ts 10 67.97%
Totals Coverage Status
Change from base Build 7842206814: -3.7%
Covered Lines: 8197
Relevant Lines: 10369

💛 - Coveralls

@fenos fenos force-pushed the multitenant/migrations-execution branch from 30370d1 to 5e47df6 Compare January 30, 2024 12:03
@fenos fenos force-pushed the multitenant/migrations-execution branch 2 times, most recently from d350181 to 687e60e Compare February 5, 2024 13:50
@@ -1,24 +1,36 @@
FROM node:18-alpine
# Base stage for shared environment setup
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved Dockerfile to use node_modules cache on each step

#######################################
# MULTI_TENANT=true
DATABASE_MULTITENANT_URL=postgresql://postgres:[email protected]:5433/postgres
REQUEST_X_FORWARDED_HOST_REGEXP=
REQUEST_X_FORWARDED_HOST_REGEXP=^([a-z]{20}).local.(?:com|dev)$
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value for multitenant x-forwarded-host: {tenant-id}.local.dev to make it easy to run Storage out of the box as multitenant

@@ -85,7 +78,7 @@ STORAGE_BACKEND=s3
#######################################
# S3 Backend
#######################################
STORAGE_S3_BUCKET=name-of-your-s3-bucket
STORAGE_S3_BUCKET=supa-storage-bucket
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value for bucket name which is the same as the bucket we create on the docker-compose.
This makes it easy to simply rename the .env.sample to .env without any additional changes

@@ -23,9 +23,10 @@ A scalable, light-weight object storage service.

- Copy `.env.sample` to `.env` file.
- Copy `.env.test.sample` to `.env.test`.
- Change `GLOBAL_S3_BUCKET` and `REGION` to the name and region of a S3 bucket.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now simply rename .env.sample to .env and that's it, all default values will work out of the box

@@ -258,8 +261,9 @@ export function getConfig(options?: { reload?: boolean }): StorageConfigType {
logflareEnabled: getOptionalConfigFromEnv('LOGFLARE_ENABLED') === 'true',
logflareApiKey: getOptionalConfigFromEnv('LOGFLARE_API_KEY'),
logflareSourceToken: getOptionalConfigFromEnv('LOGFLARE_SOURCE_TOKEN'),
defaultMetricsEnabled:
getOptionalConfigFromEnv('DEFAULT_METRICS_ENABLED', 'ENABLE_DEFAULT_METRICS') === 'true',
defaultMetricsEnabled: !(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default metrics are now on by default

@@ -346,7 +396,7 @@ async function refreshMigrationPosition(
if (shouldUpdateMigrations) {
await client.query(`BEGIN`)
try {
await client.query(`DELETE FROM ${migrationTableName}`)
await client.query(`DELETE FROM ${migrationTableName} WHERE id is not NULL`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running deletes without where statements seems to be blocked by default on staging / prod with a pg extension

if (!pgQueueEnable) {
return reply.code(400).send({ message: 'Queue is not enabled' })
}
const queueSize = await Queue.getInstance().getQueueSize(RunMigrationsOnTenants.getQueueName())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

admin route to see how many remaining migrations are pending at: GET /admin/migrations

payload:

{
  "remaining": 10
}

const is4xxResponse = statusOfType(log.res.statusCode, 400)
const is5xxResponse = statusOfType(log.res.statusCode, 500)

const logLevel = is4xxResponse ? 'warn' : is5xxResponse ? 'error' : 'info'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @egor-romanov to have pointed out that request logs were always info.
This changes the log level:

  • anything 4xx is a warning
  • anything 5xx is an error

return {}
}

static batchSend<T extends BaseEvent<any>[]>(messages: T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented batch insert for inserting multiple jobs in one insert statement

tenantId: string
}

export class RunMigrationsOnTenants extends BaseEvent<RunMigrationsPayload> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job that handles tenant migrations

src/server.ts Outdated
@@ -28,6 +33,20 @@ const exposeDocs = true
if (isMultitenant) {
await runMultitenantMigrations()
await listenForTenantUpdate(PubSub)

runMigrationsOnAllTenants()
Copy link
Contributor Author

@fenos fenos Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on server start up, all storage api instances will race to acquire the lock.
The one who acquired the lock will create the jobs for running the migrations, the other instances who failed will simply return and do nothing.

@fenos fenos force-pushed the multitenant/migrations-execution branch 11 times, most recently from 707b53f to 186f5e8 Compare February 8, 2024 16:54
@fenos fenos force-pushed the multitenant/migrations-execution branch from 186f5e8 to e291a85 Compare February 9, 2024 09:38
@fenos fenos force-pushed the multitenant/migrations-execution branch from ca4dfb8 to 5ba7cc1 Compare February 9, 2024 09:56
return
}
const result = await multitenantKnex.raw(`SELECT pg_try_advisory_lock(?);`, [
'-8575985245963000605',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats does this number represent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock id

@fenos fenos merged commit 83ad222 into master Feb 9, 2024
1 check passed
@fenos fenos deleted the multitenant/migrations-execution branch February 9, 2024 10:41
Copy link

github-actions bot commented Feb 9, 2024

🎉 This PR is included in version 0.48.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants